1.1 I am rather exited to start this course, although I am a bit worried that the workload will be too much for me. I hope though, that if I manage with the the workload, I will have the skills to start independently developing as a user of R and a contributor to open data science. It is especially the latter point that brought me to this course, although I did not actively look for it. I just happened to run into it on the course page of the University of Helsinki.
1.2 In what has been a rather unpleasant experience, GitHub did not immediately work for me. Nothing seemed to upload to my diary. Now, I think, I have managed to overcome my issues and get things to work most of the time. Learning by doing, I suppose!
To summarize:
- I want to learn open data science.
- I am worried the workload will be too much.
- I look forward to the course.
- I hope I will figure out the specifics of git and rstudio.
Here’s the link to by GitHub repository.
Libraries:
library(dplyr)
library(ggplot2)
library(GGally)
This chapter analyses a selection of data from a 2014 survey of students participating in an introductory statistics course in Finland. The survey mapped students’ learning approaches and learning achievements. While the original data contained 183 observations of 60 variables, a more limited dataset of 166 observations of 7 variables will be employed here. These variables are the age and gender of the participants, their points from the course representing their performance, their attitude towards the course, and three variables mapping their learning styles. These learning styles were the “surface approach,” indicating memorization without deeper engagement, “deep approach,” indicating an intention to maximize understanding of the subject matter, and “strategic approach,” indicating an approach aimed at maximizing the students chance at a good grade. The variables “attitude,” “surface approach,” “deep approach,” and “strategic approach” are all aggregate mean measures of other variables. As such, each variable summarizes related observations into an average. This analysis used the below script, in combination with existing knowledge, to interpret the dataset:
Learn2014 <- read.table("Data/Learn2014", header = TRUE, sep = "\t")
Learn2014$gender <- factor(Learn2014$gender, levels=c("0","1"), labels=c(0,1))
str(Learn2014)
## 'data.frame': 166 obs. of 7 variables:
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ gender: Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 2 ...
## $ attit : num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ strat : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Points: int 25 12 24 10 22 21 21 31 24 26 ...
The below graphs and summaries of the data help us gain an initial picture of the trends present therein. For one, we can see that a vast majority of students participating in this survey were female (110 v. 56 males), with a mean age of 25 and a half years and approx. 75% of students being below the age of 27.
As for the variables related to studying, all of them approximate a normal distribution, although with a slight skew to the right. Certain immediately interesting pieces of information arise from the correlation numbers. Firstly, positive attitude is strongly correlated with higher points, while the deep approach seems to counter intuitively have little effect on performance. Nevertheless the surface approach seems to predict a slightly worse performance, while the strategic approach predicts a slightly better performance. Curiously, age among men seems to predict a worser performance, although this might be due to two outliers. We shall next test these initial findings with a multiple linear regression.
Graph_AgeGeN <- ggpairs(Learn2014, columns = c(1, 2), legend = 3, title = "Age and Gender", mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
Graph_AgeGenPoints <- ggpairs(Learn2014, columns = c(1,2,7), title = "Effects of Age and Gender on Points", mapping = aes(shape = gender, col = gender), lower = list(combo = wrap("facethist", bins = 20)))
Graph_PredPoints <- ggpairs(Learn2014, columns = c(3:7), title = "Attitude, Study Style and Points", mapping = aes(shape = gender, col = gender), lower = list(combo = wrap("facethist", bins = 20)))
Graph_AgeGeN
Graph_AgeGenPoints
Graph_PredPoints
summary(Learn2014)
## Age gender attit deep surf
## Min. :17.00 0: 56 Min. :1.400 Min. :1.583 Min. :1.583
## 1st Qu.:21.00 1:110 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.417
## Median :22.00 Median :3.200 Median :3.667 Median :2.833
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :2.787
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.167
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :4.333
## strat Points
## Min. :1.250 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:19.00
## Median :3.188 Median :23.00
## Mean :3.121 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:27.75
## Max. :5.000 Max. :33.00
For the below multiple linear regression, three predictor variable have been chosen: Attitude, the surface approach, and the strategic approach. These variables were chosen due to their relatively higher correlations compared to other available variables (age for males is excluded due to the presence of outliers skewing the calculation). The below multiple linear regression shows that only attitude has a statistically significant impact on points, as it is the only independent variable that has its p-value below 0.05. In the case of attitude, there is a less than 0.1 percent chance that the null-hypothesis (attitude has no effect on points) is correct under the observed circumstances. Not only is attitude a statistically significant predictor of points, it also seems to have a strong impact, with its beta coefficient being approx. 3.4. This means that with each 1-point step towards a better attitude on the linkert scale, points seem to rise approximately by 3.4.
With the remainder of data, the likelihood is above 5 percent, which is the conventional cut line for statistical significance. This interpretation is also supported by the t-values, which conventionally are expected to be larger than 2, or lesser than -2, to indicate statistical significance. Altogether, this model nevertheless only explain approximately 20% of the variation in points, meaning that is not a very good predictive model.
Points_regression <- lm(Points ~ attit + strat + surf, data = Learn2014)
summary(Points_regression)
##
## Call:
## lm(formula = Points ~ attit + strat + surf, data = Learn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attit 3.3952 0.5741 5.913 1.93e-08 ***
## strat 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
To further test the significance of attitude, the Surface Approach and Strategic Approach variables will be removed and a simple linear regression carried out with just attitude as the predictor variable. This, nevertheless, produced no novel results and with the dropping of variables, also the explanatory power, Multiple R_squared, of the model has gone down from 0.2 to 0.19. This means that changes in students’ attitude can help explain 19% of the changes in students’ score. The fact that the reduction is so minor is further indication of the minor impact of Surface Approach and Strategic Approach variables. To play around a bit, I have also included a multiple linear regression with age included. Nevertheless, neither this has had any effect on the model. The slight rise in R-squared is to be expected every time a predictive variable is added.
Points_Attit_Reg <- lm(Points ~ attit, data = Learn2014)
summary(Points_Attit_Reg)
##
## Call:
## lm(formula = Points ~ attit, data = Learn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attit 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
AgeAttit_Regression <- lm(Points ~ attit + Age, data = Learn2014)
summary(AgeAttit_Regression)
##
## Call:
## lm(formula = Points ~ attit + Age, data = Learn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3354 -3.3095 0.2625 4.0005 10.4911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.57244 2.24943 6.034 1.04e-08 ***
## attit 3.54392 0.56553 6.267 3.17e-09 ***
## Age -0.07813 0.05315 -1.470 0.144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared: 0.2011, Adjusted R-squared: 0.1913
## F-statistic: 20.52 on 2 and 163 DF, p-value: 1.125e-08
To validate the model, this final section will run three plots to test that the assumptions of a regression model are filled by the data. For this validation, the simple linear regression model of Points_Attit_Reg will be used, as it is the most efficient of the models produced. The below graphs “Residuals vs. Fitted,” “Normal Q-Q,” and “Residuals vs. Leverage” test whether the assumptions of normal distribution, non-correlation, and constant variance of errors are met.
The Q-Q plot tests whether errors are normally distributed. The below graph shows that the dots fit reasonably on the line, although as we move towards the more extreme quantiles we can see that the distribution shows signs of being leptokurtic and as such might not be normally distributed. Nevertheless, this analysis interprets this distribution of errors as normal.
The Residuals vs Fitted graph tests the assumption of constant variance of errors by plotting residuals against predicted values. As we can see no discernible pattern in the data, we can interpret the graph as showing no indication of the size of the error depending of the predicted value. Thus constant variance of errors is established
The Residuals vs Leverage graph shows us that none of the datapoints have an unreasonably high power to pull the models predictions outwards themselves. This means that there are no outliers in the dataset. This, in combination with the above tests indivated that the model is valid, as it adheres to the integral assumptions of linear regression.
plot(Points_Attit_Reg, which = c(1,2,5))
Libraries:
library(dplyr)
library(ggplot2)
library(GGally)
library(boot)
2. and 3.
Data Description with Variable Selection and Justification
The below glimpsed dataset “TheData,” contains the questionnaire answers of 382 students from two Portuguese secondary schools . The answers were given by students attending maths and Portuguese language courses, each group having produced their own datasets that have here been combined into one dataset. In the process of combining the data, observations have been selected in a manner that assures that 13 identifying variables do not contain empty values. This has resulted in a reduction from 1044 to 382 observations per variable.
The questionnaire was created to predict the target variable of “G3,” ie. the final grade of the student attending the course. Accordingly the variables can be said to have at least a potential link to school performance, although some variables (such as whether the student lives in an urban or rural area) arguably have a more tenuous theoretical link to school performance than others (such as whether the student receives additional educational support). A glimpse of the data is provided below:
## Rows: 382
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F"…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U"…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "L…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T"…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servi…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other",…
## $ reason <chr> "course", "course", "other", "home", "home", "reputation",…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes"…
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother"…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1…
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes"…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "y…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5…
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 1…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, …
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0…
## $ alcoholics <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
For the purposes of this analysis, four variables relating to alcohol use have been selected. The primary purpose of this analysis is to examine the effects of alcohol use on the final grade. Accordingly the variable “G3,” the final grade on a 20-point scale, is a given. As is “alc_use,” the variable mapping alcohol use on a five-point scale where “1” indicates very low consumption and “5” very high consumption (This variable is the mean of the student’s alcohol consumption on weekdays and weekends, mapped by variables “Dalc” and “Walc,” respectively). The hypothesized relationship between “G3” and “alc_use” is that the higher consumption of alcohol is a predictor of lower achievement in school, which is represented by G3. Furthermore, it is hypothesized that the mechanism that might explain any potential causal relationship is the amount of absences (measured in days out of 93) arising from either reduced energy or hangover caused by higher amounts of drinking (I realize that this is a bold assumption to make before examining the relationship between alcohol use, absences and the final grade, but the tasking requires naming the four variables now.) The benefit of this causal explanation is that it does not require knowledge on the effects of alcohol use on the brain, nor does it, in the case of having such knowledge, demand that the high use is long term - a common qualifier with alcohol related learning difficulties, but something for which the dataset contains no data.
As the working theory is that higher alcohol use has a negative effect on school performance, it is also useful to theorize about the reasons behind higher alcohol use. Here two variables are examined: “freetime,” ie. how much free time the student has in a week on a five-point scale (1 denoting very little, 5 very much), and “famrel,” ie. how good the students relationship is to their family on a five-point scale (1 denoting a very bad relationship, 5 an excellent relationship.). The theorized relationships are as follows: the more free time one has, the more they drink to pass the time, and the worse their relationship is with their family, the more they drink for comfort (The same cave-at applies here, as with the previous relationship). These are the relationships that will be explored below: A) The effects of alcohol use on the final grade; B) The effects of alcohol use on absences and the effects of absences on the grade; C) The effects of free time on alcohol use; D) The effects of family relations on alcohol use. Any further interesting relationships will be explored as warranted by the initial results (such as the effects of family relationship, given a lot of free time, on alcohol use).
4.
Numerical and graphical exploration of relationships A through D.
A and B
The above set of graphs explores the relationships between alcohol use, absences and the final grade. The results have been further divided by sex in the spirit of last week. A few noteworthy points can immediately be noticed. Firstly, there seems to be, overall, no statistically significant relationship between the number of absences and the final grade. This, if anything, is a troubling result for Portuguese teachers. Admittedly, with males there seems to be a somewhat statistically significant relationship. On the other hand, alcohol use would seem to predict both higher levels of absences and lower scores, although here too the difference between males and females is significant.
Since there is no theoretical reason for this division, it raises some questions over the data. As such, before delving into the numbers further, we need to examine the data more to see if the cause for these variations between sexes can be explained by abnormalities in the observations. Immediately two observations jump up from the data: in the column where absences are on the Y-axis, we can note two observations, both female, that could constitute outliers. To examine this further, we will carry out a regression analysis where the absences are the explanatory variable for final score, and a regression analysis where the alcohol use is the explanatory variable for absences. Both analysis will be then subjected to the residuals vs leverage test from last week, which will help us indicate whether some of the datapoints have an unreasonably high power to pull the models’ predictions outwards towards themselves.
##
## Call:
## lm(formula = G3 ~ absences, data = TheData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7235 -1.6055 0.3355 2.3355 6.8688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.72350 0.21879 53.583 <2e-16 ***
## absences -0.05897 0.03094 -1.906 0.0574 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.299 on 380 degrees of freedom
## Multiple R-squared: 0.009471, Adjusted R-squared: 0.006865
## F-statistic: 3.634 on 1 and 380 DF, p-value: 0.05738
##
## Call:
## lm(formula = absences ~ alc_use, data = TheData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.417 -3.442 -1.442 1.576 41.558
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2523 0.5918 3.806 0.000165 ***
## alc_use 1.1901 0.2779 4.282 2.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.343 on 380 degrees of freedom
## Multiple R-squared: 0.04603, Adjusted R-squared: 0.04352
## F-statistic: 18.33 on 1 and 380 DF, p-value: 2.349e-05
It seems that the two data points have such a high leverage, as to bring their validity into question. Of course, in the absence of reasoned proof that they are invalid, they should be left in. In the interest of this exercise, I have nevertheless decided to apply the rule of thumb that observations with a Cook’s Distance higher than n/4 (where n is the number of observations) can be removed. Let us see what the end result is, after we apply this procedure to the data.
##
## Call:
## lm(formula = G3 ~ absences, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7846 -1.6377 0.2888 2.2888 6.9496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.78457 0.22435 52.529 <2e-16 ***
## absences -0.07342 0.03461 -2.121 0.0346 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.28 on 376 degrees of freedom
## Multiple R-squared: 0.01183, Adjusted R-squared: 0.009198
## F-statistic: 4.5 on 1 and 376 DF, p-value: 0.03455
##
## Call:
## lm(formula = absences ~ alc_use, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.885 -3.395 -1.397 1.603 41.595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4130 0.5349 4.511 8.63e-06 ***
## alc_use 0.9921 0.2533 3.917 0.000107 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.791 on 376 degrees of freedom
## Multiple R-squared: 0.0392, Adjusted R-squared: 0.03664
## F-statistic: 15.34 on 1 and 376 DF, p-value: 0.0001066
The new Dimensions
## [1] 378 35
With the removal of just four observations with a Cook’s distance higher than 4/n, as testified by the new dimensions, we can see that absences now function as a statistically significant predictor of academic performance. I would argue that despite the absence of observation specific reasons supporting the removal of the observations, the overall logical expectation that presence at class predicts performance, and the magnitude of change in the statistical significance of the results, warrants the removal of these values. As such, moving onward, this analysis relies on the now modified dataset.
Finally, to test these results against just the effects of alcohol use on the final grade and the effects of absences, given high alcohol use, we will conduct two more regression analysis:
##
## Call:
## lm(formula = G3 ~ alc_use, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4094 -1.8989 0.1011 2.1011 6.3459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3884 0.3645 33.984 < 2e-16 ***
## alc_use -0.4895 0.1726 -2.836 0.00482 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.265 on 376 degrees of freedom
## Multiple R-squared: 0.02094, Adjusted R-squared: 0.01833
## F-statistic: 8.041 on 1 and 376 DF, p-value: 0.004819
##
## Call:
## lm(formula = G3 ~ absences, data = filter(TheData_2, alc_use >
## 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8329 -0.7404 1.1671 1.8142 5.6294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.83286 0.93802 11.549 1.72e-13 ***
## absences -0.04622 0.11495 -0.402 0.69
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.606 on 35 degrees of freedom
## Multiple R-squared: 0.004599, Adjusted R-squared: -0.02384
## F-statistic: 0.1617 on 1 and 35 DF, p-value: 0.69
Looking at the results of the regression analysis, we can see that despite all the work that went into removing the high Cook’s Distance observations, alcohol use on its own is still a stronger and statistically more significant predictor of poorer academic performance than the number of absences. We also see that the number of absences given high alcohol use does not provide anything better in terms of predictive power than just absences. As such, contrary to the originally proposed mechanism, while alcohol use is a statistically significant and strong predictor of absences (one move up the alcohol use scale corresponds to almost one full day of additional absences), absences themselves do not function as a strong predictor of poorer academic performance. In fact absences only explain approximately half of the change in final grade that is explained by alcohol use. We can as such conclude that there is strong evidence that while alcohol use results in poorer performance, it does not do that through absences.
C and D
As expected, both negative family relations and free time are statistically significant predictors of alcohol use. We can examine these in more detail with linear regression, as has been done below. We can see that both variables are statistically significant predictors of alcohol use. As for the hypothesized impact of poor family relations, given lots of free time, it does not have an effect larger than just poor family relations. In fact, given free time, poor family relations seem to have a lower effect, but this difference is not statistically significant:
##
## Call:
## lm(formula = alc_use ~ freetime, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1822 -0.8364 -0.1822 0.5095 3.1636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.31755 0.16884 7.803 6.02e-14 ***
## freetime 0.17294 0.05015 3.449 0.000627 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9604 on 376 degrees of freedom
## Multiple R-squared: 0.03066, Adjusted R-squared: 0.02808
## F-statistic: 11.89 on 1 and 376 DF, p-value: 0.0006273
##
## Call:
## lm(formula = alc_use ~ famrel, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2930 -0.8469 -0.2576 0.4924 3.2778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.43567 0.22189 10.977 < 2e-16 ***
## famrel -0.14269 0.05497 -2.596 0.00981 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9668 on 376 degrees of freedom
## Multiple R-squared: 0.0176, Adjusted R-squared: 0.01499
## F-statistic: 6.738 on 1 and 376 DF, p-value: 0.009808
##
## Call:
## lm(formula = alc_use ~ famrel + freetime, data = TheData_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3850 -0.7974 -0.2116 0.5067 3.0067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.92585 0.25381 7.588 2.6e-13 ***
## famrel -0.17339 0.05452 -3.180 0.001594 **
## freetime 0.19586 0.05007 3.912 0.000109 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.949 on 375 degrees of freedom
## Multiple R-squared: 0.05612, Adjusted R-squared: 0.05108
## F-statistic: 11.15 on 2 and 375 DF, p-value: 1.982e-05
##
## Call:
## lm(formula = alc_use ~ famrel, data = filter(TheData_2, freetime >
## 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20823 -1.00998 -0.07606 0.85786 2.99002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.34040 0.45699 5.121 9.64e-07 ***
## famrel -0.06608 0.11041 -0.599 0.55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 143 degrees of freedom
## Multiple R-squared: 0.002499, Adjusted R-squared: -0.004477
## F-statistic: 0.3582 on 1 and 143 DF, p-value: 0.5504
As such, we can conclude section 4 by summarizing that while alcohol has a negative effect of academic performance, and poor family relations and free time increase alcohol consumption, all of these have (while statistically significant) only have modest impact, if we look at R2: Alcohol use only explains approximately 2% of the variance in the final grades, while poor family relations and free time, even when taken together, only explain approximately 5% of the variance in alcohol consumption. As such, while we have some proof of causal relationships, those relationships are not strong. We can additionally reject the hypothesis that the mechanism by which alcohol consumption affects grades is the increased amount of absences.
5.
Logistic Regression of the above variables.
In the above analysis we have treated alcohol use either as an explanatory variable (A-B) or as a target variable (C-D) in a linear function. Here, alcohol use will be defined as a binomial variable, where individuals having an alcohol consumption higher than 2/low, will be labeled as “alcoholics.” As such, individuals with alc_use of three or higher will belong to the category “alcoholics,” while those with less will not. To model the other above variables within this framework will require the use of Logistic Regression, which calculates the probability of an individual belonging to a category (here, alcoholics), based on the model inputs. A probability higher than 0.5 will indicate belonging to a group.
We will employ all the other variables used above, including the variable absences, since it did have a statistically significant relationship with alcohol use. Consequently we get the following Logistic Regression:
##
## Call:
## glm(formula = alcoholics ~ famrel + freetime + absences + G3,
## family = "binomial", data = TheData_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6726 -0.8342 -0.6456 1.1785 2.0220
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.56725 0.73233 -0.775 0.438585
## famrel -0.31613 0.12903 -2.450 0.014280 *
## freetime 0.41441 0.12521 3.310 0.000934 ***
## absences 0.07175 0.02371 3.027 0.002473 **
## G3 -0.06959 0.03593 -1.937 0.052781 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 455.91 on 377 degrees of freedom
## Residual deviance: 424.42 on 373 degrees of freedom
## AIC: 434.42
##
## Number of Fisher Scoring iterations: 4
The Odds Ratios and Their Confidence Intervals
## OR 2.5 % 97.5 %
## (Intercept) 0.5670801 0.1325241 2.3671826
## famrel 0.7289627 0.5649509 0.9387214
## freetime 1.5134784 1.1886426 1.9440711
## absences 1.0743848 1.0268123 1.1277893
## G3 0.9327736 0.8689267 1.0008341
In the above summary we can see that the variables used have a wide range of statistical significance. As the commonly accepted cut-off point for statistical significance is a p score of less than 0.05, alongside absolute z values higher than 2 and 95% confidence intervals that do not include 1, we can conclude that the variable G3, or the final grade, has no statistical significance on our model. As such, we can drop it going forward. * (Confidence intervals going across 1 indicate that the 95% confidence interval contains the coefficient 1, indicating no relationship between the predictor and target variable.)
The odds ratios support our initial hypothesis. Since odds ratios higher than 1 indicate that the variable is positively correlated with the observation/individual belonging to the group (in this case alcoholics), both free time and absences positively predict belonging to the alcoholics group.
Since higher family relations negatively predict belonging to alcoholics, we can conclude that the hypothesized positive impact of bad family relations holds.
Nevertheless, as stated above, the impact of these variables is minor, falling close to even.
(As final grade is not statistically significant, it has been ignored)
6.
The below numerical and graphical explorations detail the accuracy of the model without variable G3. While the plot would seem to indicate a rather random sorting of predictions, a closer examination carried out through the tabulation of predictions against the data showcase a more nuanced model. It is rather clear that the model over predicts non-alcoholics and if it does predict an alcoholic, there is (approximately) a 50/50 chance of that being prediction being right, but since the majority of cases are non-alcoholics, the model’s training error is “only” 0.29, meaning that 29% of the predictions are incorrect. This is above mere random guessing, or flipping the coin, especially since the alcoholics and non-alcoholics are not split 50/50. Nevertheless, we can see both from the graph and the confusion matrix, that the model misses many, many cases where the individual does belong to the group “alcoholics.” As such, it is not a good model.
##
## Call:
## glm(formula = alcoholics ~ famrel + freetime + absences, family = "binomial",
## data = TheData_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7039 -0.8148 -0.6661 1.2152 1.9476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.33999 0.61829 -2.167 0.030215 *
## famrel -0.32602 0.12889 -2.529 0.011424 *
## freetime 0.41713 0.12437 3.354 0.000797 ***
## absences 0.07582 0.02361 3.211 0.001323 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 455.91 on 377 degrees of freedom
## Residual deviance: 428.17 on 374 degrees of freedom
## AIC: 436.17
##
## Number of Fisher Scoring iterations: 4
## prediction
## alcoholics FALSE TRUE
## FALSE 255 13
## TRUE 98 12
## [1] 0.2936508
BONUS
By using a ten-fold cross-validation, we can train the data on one-tenth of “TheData” and then check its accuracy (defined by the ratio of incorrect guesses as above) against the remaining nine sets of one-tenth of the data. This is done below. The ratio of 0.3 indicates that the model that was trained on one tenth of the data performs similarly within the rest of the data compared to the model trained on the whole data. It is worse than the one introduced in the DataCamp. I was able to find a better one after having played around with the above variables in connection with sex, failures and goout.
## [1] 0.3042328
I was able to find a better model after having played around with the above variables in connection with sex, failures and goout. This has an error rate of 0.24 in a ten-fold cross-validation
## [1] 0.2433862
THE END!
The dataset, “Boston,” used in this analysis can be dowloaded with the “MASS”-package. As such, it can be seen as a training dataset of sorts. It contains 14 variables with a (potential) connection to housing values in the suburbs of Boston. These variables are:
| Variable | Explanation |
|---|---|
| “crim” | per capita crime rate by town. |
| “zn” | proportion of residential land zoned for lots over 25,000 sq.ft. |
| “indus” | proportion of non-retail business acres per town. |
| “chas” | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| “nox” | nitrogen oxides concentration (parts per 10 million). |
| “rm” | average number of rooms per dwelling. |
| “age” | proportion of owner-occupied units built prior to 1940. |
| “dis” | weighted mean of distances to five Boston employment centres. |
| “rad” | index of accessibility to radial highways. |
| “tax” | full-value property-tax rate per $10,000. |
| “ptratio” | pupil-teacher ratio by town. |
| “black” | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| “lstat” | lower status of the population (percent). |
| “medv” | median value of owner-occupied homes in $1000s. |
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Each variable has 506 observations/points of data and the below computational analysis would seem to indicate that it contains no empty points of data and each of the 14x506=7084 observations is numeric/integer. Some observations are ratios, some percentages, and at least one (chas) is a dummy variable coded 0/1.
Empty <- 0
for (x in row(Boston)){
if (is.na(x) == TRUE){
Empty <- Empty+1
}
}
Numer <- 0
for (x in row(Boston)){
if (is.numeric(x) == TRUE){
Numer <- Numer+1
}
}
Below, the reader can find the simple bar graphs of each variable, as well as a summary of each examined variable:
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
As the above general overview indicates, the data takes various values in various ranges - as one would expect from a dataset containing various different measures. Commenting on all the distributions seems pointless at first sight, but most of the graphs indicate some interesting things.
To begin with, we see the age-graph indicate an aging city. More importantly we see that its automatic scale seems off. Either there is a high amount of areas in the city where the proportion of buildings built before 1940 close to 100, or the dataset has a typo. Or, as a final thought that seems the most likely: many of properties surveyed for this dataset come from the same Boston town/area and hence share exactly the same variable observations for some area-specific variables.
We see the black-graph empty. Yet again, a closer examination (carried out below) shows that the data takes some interesting values. I do not have the knowledge to say what the measure “1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town” should indicate, but a approx. 120 of the properties seem to get a value close to 397, while other values occur only once. In fact, the Summary statistic presented above shows that the value is probably 396,90 and that the variable is also curious since most of its observations are within the range of 370 and 400, but its smallest observation is 0.32. It might be that the small observation has not been multiplied by 1000 as per the formula, since the result would come close to the expected range. Yet again, this can be a sign of a typo, or something else going on. And as above, the repetition of one value might be explained by many of properties surveyed for this dataset coming from the same Boston town/area and hence sharing exactly the same variable observations for some area-specific variables.
The Crim-graph also looks empty, but again, the more detailed look below shows that most values range from 0 to 1, as one would expect from a per-capita rate. The fact that the general graph above has a range of 0 to 75 would indicate a typo or some placeholder value. Perhaps certain observations have been multiplied by hundred to give a percentage, or the person recording the obervation has forgotten a dot/comma. This would seem to be the case based on the summary statistic, since the max-values are high above the median and mean.
The dis-graph looks empty as well, but the below closer look shows the granular level of observations. With no aggregation, the single lines disappear from the graph when it is extended to contain all values. The summary statistic indicates nothing out of the ordinary.
The final empty-appearing graph, indus, seems to suffer from the same issue as the black-graph. As one can see below, most value-counts range between 0 and 10, but around the 17-mark there seems to be one value with a high count, approx 120, of observations. This same issue can also be observed in the tax-, rad-, and p-ratio- graphs, although no detailed look is carried out below. This is further evidence for the fact many of properties surveyed for this dataset might come from the same Boston town/area and hence share exactly the same variable observations for some area-specific variables.
Finally, the zn-graph looks odd as well, but I would argue that that is just the result of strict zoning-laws prohibiting the amount of large properties in most areas (observe the large count in value “0”)
As for the relationships between the variables, the below matrix shows the correlations of each varibale paired with each of the ohters. Of note is the fact that the matrix seemingly indicates that each value has some statistically significant relationship with one each of the other variables. Of note is the only exception, the chas-variable, which is also the only dummy variable. An interesting question in this regard is why the chas variable is the only one that does not have statistically significant correlations with most other variables. The first answer is simple: because the fact that a tract bounds the river has no statistically significant impact on many of the other variables. The second option comes down to the inner workings of R - it might be that the cor.mtest-function used here to map p-values does not function well for dummy variables. No mention of this possibility is given by the ?cor.mtest-command.
On the other hand, it is perhaps not surprising that variables that are expected to be significant predictors of housing prices, also have statistically significant correlations which one-another. Out of these correlations a few should be highlighted in preparation for the coming faces. The variable “crim” (per-capita crime rate by town) seems to a a strong, statistically significant positive correlation with high property-tax properties, as well as properties with easier access to radial highways. Property-crime is a good explanatory factor for these correlations - high tax- and rad-variables indicate high-value targets(former) and/or easy get-away and access options(latter).
higher levels of industry (indus), house age (age), air pollution (nox), pupil-to-teacher ratio (ptratio), and population’s lower status all have a weaker positive correlation with crime rate. This perhaps indicates a second category of neighborhoods compared to the above: the older impoverished industry areas with less access to good education.
Both higher median value and longer distance from employment centers correlated weakly and negatively with a higher crime rate. I have a hard time explaining this. Perhaps it is due to the existence of middle-class suburbs, which are not attractive to property theft due to distance to a poor city center? This conclusion is perhaps supported by the strong negative correlation between the dis-variable on one hand and the indus-, nox-, and age-variables, which would seem to indicate that the (employment) centers of the city are older industry neighborhoods. All of this is of course anecdotal in the absence of clearer information.
Finally, the black-varibale seems to be negatively and weakly correlated with a higher crime rate, but as I do not understand the calculations behind the variable, it is rather hard to interpret the (potential) meaning of the correlation - as such I will drop it going forward.
Below the reader can find a summary of the standardized Boston variables. All of them can be seen to share a mean of zero, which is of course by definition a feature of a standardized variable. They are also all on the same scale now, which means that they can be compared to one another easier - although that would not be immediately clear from the data, since the value-distributions still retain their curious aspects: for example with the variable “black,” the min is still far-far-far to the left from the rest of the data. Additionally, the standardized binomial variable “chas” has arguably become non-sensical. The old value of 0 has been replaced by -0.2723 and the old value for 1 has been replaced by 3.6648.
It should also be noted that none of the variables can be fully standardized into standard normal distributions, since they do not adhere to a normal distribution to begin with. This is, at least in, probably due to the (theorized) over representation of one neighborhood in the dataset.
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
The reader can also note that in this second set, the crim-varibale has been replaced by the Crime factor-variable, as per instructions, and the chas-variable has be returned to its original binomial state. Even further down, the reader can finally find the test set with the removed Crime variable, after the correct answers had been saved.
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
## zn indus nox rm
## Min. :-0.48724 Min. :-1.5563 Min. :-1.4644 Min. :-3.8764
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.9121 1st Qu.:-0.5681
## Median :-0.48724 Median :-0.2109 Median :-0.1441 Median :-0.1084
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.: 0.5981 3rd Qu.: 0.4823
## Max. : 3.80047 Max. : 2.4202 Max. : 2.7296 Max. : 3.5515
## age dis rad tax
## Min. :-2.3331 Min. :-1.2658 Min. :-0.9819 Min. :-1.3127
## 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668
## Median : 0.3171 Median :-0.2790 Median :-0.5225 Median :-0.4642
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294
## Max. : 1.1164 Max. : 3.9566 Max. : 1.6596 Max. : 1.7964
## ptratio black lstat medv
## Min. :-2.7047 Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.2746 Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 1.6372 Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
## Crime Boston.chas
## Lowest :127 Min. :0.00000
## Lower :126 1st Qu.:0.00000
## Higher :126 Median :0.00000
## Highest:127 Mean :0.06917
## 3rd Qu.:0.00000
## Max. :1.00000
## zn indus nox rm
## Min. :-0.48724 Min. :-1.42220 Min. :-1.46443 Min. :-3.8764
## 1st Qu.:-0.48724 1st Qu.:-0.74002 1st Qu.:-0.77189 1st Qu.:-0.7769
## Median :-0.48724 Median : 0.01796 Median : 0.04578 Median :-0.2279
## Mean :-0.09483 Mean : 0.18062 Mean : 0.14941 Mean :-0.1578
## 3rd Qu.:-0.48724 3rd Qu.: 1.01499 3rd Qu.: 0.87424 3rd Qu.: 0.3033
## Max. : 3.58609 Max. : 2.42017 Max. : 2.72965 Max. : 3.0078
## age dis rad tax
## Min. :-2.2017 Min. :-1.2623 Min. :-0.98187 Min. :-1.30676
## 1st Qu.:-0.3810 1st Qu.:-0.8736 1st Qu.:-0.63733 1st Qu.:-0.76385
## Median : 0.5284 Median :-0.5047 Median :-0.52248 Median :-0.03107
## Mean : 0.1834 Mean :-0.1425 Mean : 0.04499 Mean : 0.11825
## 3rd Qu.: 0.9707 3rd Qu.: 0.4605 3rd Qu.: 1.65960 3rd Qu.: 1.52941
## Max. : 1.1164 Max. : 2.5777 Max. : 1.65960 Max. : 1.79642
## ptratio black lstat medv
## Min. :-2.5199404 Min. :-3.52291 Min. :-1.36857 Min. :-1.90634
## 1st Qu.:-0.7185093 1st Qu.: 0.18324 1st Qu.:-0.67470 1st Qu.:-0.72934
## Median : 0.3207778 Median : 0.37670 Median : 0.05138 Median :-0.25908
## Mean :-0.0002918 Mean : 0.06868 Mean : 0.20566 Mean :-0.05186
## 3rd Qu.: 0.8057784 3rd Qu.: 0.43832 3rd Qu.: 0.76346 3rd Qu.: 0.26826
## Max. : 1.2676838 Max. : 0.44062 Max. : 3.54526 Max. : 2.98650
## Boston.chas
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.06863
## 3rd Qu.:0.00000
## Max. :1.00000
Despite the fact that none of the variables adhere to the assumption normal distribution required by the LDA, nor is the Chas-variable continuous as is usually expected, below the reader can find the required LDA-(bi)plot. It contains the categorical crime rate as the target variable and all of the remaining variables as predictor variables (even the black-variable, despite what I said earlier about not using it.) Observing both the biplot and LDA data, we can see that LD1 explains 96 percent of the between-group variance, while LD2 explains three percent and LD3 one percent.
## Call:
## lda(Crime ~ ., data = TrainSet)
##
## Prior probabilities of groups:
## Lowest Lower Higher Highest
## 0.2623762 0.2500000 0.2450495 0.2425743
##
## Group means:
## zn indus nox rm age dis
## Lowest 0.94873938 -0.89313923 -0.8545752 0.42801907 -0.8755807 0.8451251
## Lower -0.06165277 -0.31403667 -0.5964211 -0.08767888 -0.3996310 0.3987432
## Higher -0.37290116 0.08583784 0.3417213 0.12938920 0.3487895 -0.3356536
## Highest -0.48724019 1.01499462 1.0382944 -0.33907005 0.8156924 -0.8377148
## rad tax ptratio black lstat medv
## Lowest -0.6752522 -0.7315526 -0.42960060 0.38432721 -0.77989661 0.50243720
## Lower -0.5361295 -0.4971113 -0.04760321 0.31681294 -0.19061341 0.01925548
## Higher -0.4192384 -0.3453652 -0.28879737 0.09365932 -0.01323016 0.17973617
## Highest 1.6596029 1.5294129 0.80577843 -0.90831104 0.83932500 -0.69089518
## Boston.chas
## Lowest 0.03773585
## Lower 0.06930693
## Higher 0.12121212
## Highest 0.05102041
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08866773 0.718788171 -0.94589532
## indus 0.05331710 -0.132123462 0.44938777
## nox 0.17161192 -0.906743857 -1.38261185
## rm -0.11183753 -0.098954475 -0.09108170
## age 0.26550601 -0.291632819 -0.14216674
## dis -0.09554650 -0.350826244 0.26710102
## rad 3.62497052 0.984486539 -0.03752939
## tax 0.14286346 -0.098355568 0.44636102
## ptratio 0.11231641 0.040946150 -0.24712381
## black -0.12587821 0.002389311 0.11140499
## lstat 0.17795303 -0.298690391 0.44507952
## medv 0.18192519 -0.448381368 -0.16503517
## Boston.chas -0.38750819 -0.261604065 0.38667135
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9601 0.0295 0.0104
The table below showcases the cross-tabulated results of the predictions against the actual categories. We can see that the model predicts rather accurately the group belonging in for properties in higher and high crime rate areas, while it struggles a bit more in the lower and lowest categories. Indeed the model seems to slightly over-predict higher crime rates, especially when provided with data of a property in a lower/lowest crime rate area. Nevertheless, it out-performs simple quessing, under which a property would have an almost equal 25% chance in belonging to any of these categories, as indicated in the previous section’s model output. As such, a simple random division of the properties into four equal-sized groups would result, on average, in three incorrect predictions per one correct prediction. Such odds are much worse than the odds for the model correctly predicting a property belonging to a lowest crime rate area.
## predicted
## correct Lowest Lower Higher Highest
## Lowest 14 7 0 0
## Lower 4 16 5 0
## Higher 0 8 17 2
## Highest 0 0 1 28
The below two graphs showcase the results of the final K-Means analysis. The first graph details the change the total within cluster sum of squares as we increase the amount of clusters from 1 to 14. The aim is to use the graph to find the optimal amount of clusters. As it is clear that a more granular level will lead to smaller within cluster sum of squares (WCSS) without necessarily being a better grouping devise (Consider for example that the smallest within cluster sum of squares comes from having only one observation in each “cluster”, meaning that no clustering has been done), we need to find a point where the WCSS drops drastically, indicating an amount of clusters that is significantly more precise than a smaller amount, but not significantly less precise than a larger amount. The first graph indicates that that point is two (2) clusters.
As for the pairs analysis produced by the clustering of pairs of variables into two clusters, we will only discuss the top row/first column of the graphs, which relate to the crime rate. This is done for purposes of limiting the discussion to the relevant aspects and not covering each of the 182 squares. What we need to keep in mind is that K-Means analysis that clusters into two groups attempts to find sets of two sets of data, where the total (in this case) euclidean absolute distances to the group mean are the smallest. If we were to have a single group which shares many of it observation values, then it would be expected, that such a group would repeat itself in each graph. And, indeed, we see most of the crime-graphs maintain a very similar, flat/narrow red-group structure throughout the groupings. To me, this is further evidence that the 120 uncommonly consistently-valued observations that section 2 identified in multiple variables come from a single group of properties from the same area. Perhaps the data showcase something else as well, but hopeflly this will suffice. This is already a long text.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.2663 4.6116 4.7275 5.9572 13.8843
This week’s data analysis mostly deals with the human-dataset. The data contained a number of variables mapping human development and gender equality in 195 countries. This data was then cut down to eight variables. They are the following variables measuring development and gender equality in 195 countries:
+ Life.Expectancy refers to a citizens average life expectancy at birth
+ Education.Years is the amount of years a citizen is planned to spend in education
+ GNI.Per.Capita is the Gross National INcome adjusted for population
+ Maternal.Mortality is a ratio of deaths per 100,000 births
+ Teen.Birth is the number of births per 1,000 women ages 15 to 19.
+ Parliamentary.Participation is the ratio of women to men in parliament
+ Female.Secondary.Education is the percentage of females that attend secondary education
+ Female.Work is the percentage of females that participate in the labour force
Below are simple bar-plots graphically showcasing the variables used this week. As the graphs show, most of these variables do not follow a gaussian distribution, meaning that they are not bell-curves. The only exception to this is the variable mapping the planned years a citizen is supposed to spend in school.
Even further down, the reader can find the correlation plot, where "*" indicates statistical significance to the 0.05-level. As can be seen, three sets of variables emerge from the data:
+ Female.Work and Parliamentary.Participation, which are not strongly correlated with anything
+ Maternal.Mortality and Teen.Birth, which are strongly positively correlated only with one another.
+ The remaining variables on education, GNI, and life-expectancy, which are (semi) strongly positively correlated with each other and strongly negatively correlated with the preceding set of variables.
## Female.Secondary.Education Female.Work Life.Expectancy Education.Years
## Min. : 0.90 Min. :13.50 Min. :49.00 Min. : 5.40
## 1st Qu.: 27.15 1st Qu.:44.45 1st Qu.:66.30 1st Qu.:11.25
## Median : 56.60 Median :53.50 Median :74.20 Median :13.50
## Mean : 55.37 Mean :52.52 Mean :71.65 Mean :13.18
## 3rd Qu.: 85.15 3rd Qu.:61.90 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :100.00 Max. :88.10 Max. :83.50 Max. :20.20
## GNI.per.Capita Maternal.Mortality Teen.Birth
## Min. : 581 Min. : 1.0 Min. : 0.60
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65
## Median : 12040 Median : 49.0 Median : 33.60
## Mean : 17628 Mean : 149.1 Mean : 47.16
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95
## Max. :123124 Max. :1100.0 Max. :204.80
## Parliamentary.Participation
## Min. : 0.00
## 1st Qu.:12.40
## Median :19.30
## Mean :20.91
## 3rd Qu.:27.95
## Max. :57.50
Below, Principal Component Analysis, relying on Singular Value Decomposition, is applied to the non-standardized “Human”-data explored above. This analysis is expected to turn out somewhat skewed, as PCA relies on variance to identify the Principal Components. Consequently, variables with larger average values will have an unreasonably high impact on the analysis.
## Standard deviations (1, .., p=8):
## [1] 18544.172057 186.283543 25.972416 20.074641 14.321634
## [6] 10.634338 3.721257 1.428190
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3
## Female.Secondary.Education -9.317458e-04 0.085169068 -0.3652346210
## Female.Work 9.124960e-05 -0.031442122 -0.0180958993
## Life.Expectancy -2.815825e-04 0.028224407 -0.0170864406
## Education.Years -9.562916e-05 0.007556395 -0.0210421918
## GNI.per.Capita -9.999828e-01 -0.005830843 0.0006412388
## Maternal.Mortality 5.655739e-03 -0.987500960 -0.1481355205
## Teen.Birth 1.233962e-03 -0.125305410 0.9184673154
## Parliamentary.Participation -5.526462e-05 0.003211742 -0.0038388487
## PC4 PC5 PC6
## Female.Secondary.Education -0.8435797499 -3.770012e-01 -6.083913e-02
## Female.Work -0.3928157700 8.233860e-01 4.077784e-01
## Life.Expectancy -0.0212996883 1.136966e-02 -6.669649e-02
## Education.Years -0.0308785262 3.896982e-03 -2.437473e-02
## GNI.per.Capita 0.0002381021 6.651486e-06 2.062449e-05
## Maternal.Mortality -0.0173448186 -3.955532e-02 -2.010447e-02
## Teen.Birth -0.3475453954 -1.381061e-01 -2.499459e-02
## Parliamentary.Participation -0.1075784134 3.989026e-01 -9.077136e-01
## PC7 PC8
## Female.Secondary.Education 0.0303039622 3.118655e-02
## Female.Work -0.0093667016 6.654689e-03
## Life.Expectancy -0.9901494274 1.161211e-01
## Education.Years -0.1134676761 -9.925031e-01
## GNI.per.Capita 0.0001028639 1.871381e-05
## Maternal.Mortality -0.0244043639 -7.101321e-04
## Teen.Birth -0.0127951595 -8.079546e-03
## Parliamentary.Participation 0.0704544742 1.925677e-02
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 186.2835 25.97 20.07 14.32 10.63 3.721 1.428
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.00 0.00 0.000 0.000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.00 1.00 1.000 1.000
As expected, the graph can tell us very little, and PC1 purports to explain (essentially) all of the variation. Let us standardize the data and try again.
By standardizing the variables in the “Human”-data, the PCA is able to work on variables with comparable variances. Accordingly, variables on scales where high values are expected will not have a disproportionately high impact on the calculations. As can be seen below, the biplot produced by the standardized data can offer valuable insights into the data.
## Standard deviations (1, .., p=8):
## [1] 2.1359720 1.1170602 0.8767066 0.7202736 0.5530904 0.5299143 0.4493522
## [8] 0.3372776
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Female.Secondary.Education -0.39874523 0.101737848 -0.210481695 -0.31033900
## Female.Work 0.14439668 0.682158952 -0.575062704 -0.27300332
## Life.Expectancy -0.43080653 0.003997077 0.073302641 -0.02783772
## Education.Years -0.41858363 0.138149663 -0.073869337 -0.07719277
## GNI.per.Capita -0.33914119 0.098797891 -0.338769060 0.84020987
## Maternal.Mortality 0.41991628 0.123208094 -0.145471957 0.28520785
## Teen.Birth 0.40271826 0.088902996 -0.003213286 0.11830579
## Parliamentary.Participation -0.07626861 0.687286162 0.691544283 0.14537131
## PC5 PC6 PC7 PC8
## Female.Secondary.Education 0.43437742 -0.50478735 0.48033587 0.12578655
## Female.Work -0.22413576 0.23657515 0.01076780 -0.04754207
## Life.Expectancy 0.04920168 0.57970641 0.04122211 0.68415054
## Education.Years 0.30831448 -0.11017948 -0.81713841 -0.13919229
## GNI.per.Capita -0.01249472 0.01564322 0.16310711 -0.16583233
## Maternal.Mortality 0.05493343 -0.43780233 -0.24286926 0.67254029
## Teen.Birth 0.81248359 0.36928738 0.07464930 -0.11761127
## Parliamentary.Participation -0.01724555 -0.11284562 0.09265604 -0.02894539
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1360 1.1171 0.87671 0.72027 0.55309 0.5299 0.44935
## Proportion of Variance 0.5703 0.1560 0.09608 0.06485 0.03824 0.0351 0.02524
## Cumulative Proportion 0.5703 0.7263 0.82235 0.88720 0.92544 0.9605 0.98578
## PC8
## Standard deviation 0.33728
## Proportion of Variance 0.01422
## Cumulative Proportion 1.00000
As the caption suggests, the data seems to indicate that while participation of women in the labour-force and political decision-making are important components of gender equality, they do not predict the overall (physical) well-being of women. More of this below. The reason why our standardized data is able to provide more insights into this, is due to the standardized means and variations that it contains. PCA creates its principal components (PC) by calculating a line of best fit through the observations of the variables that account for most of the variance (while maintaining the sum-of-squares for the resulting coefficients at =1). The second PC is calculated from the remaining variables similarly, but in such a way that it does not correlate with the first PC. With standardized data, the variations are not affected by the scales of the units of the variables and are as such comparable - no single variable can account for most of the variance due to the large values it takes. Conversely, with the non-standardized data, the variable measuring the per-capita GNI of the countries had the largest explanatory power, since the values that its variation takes are in the tens-of-thousands. The rest of the variables take much lower values (The second highest absolute values are taken by maternal mortality and even that has a range of 0 to 1000), and consequently they seemingly account for less of the variation in the data. This is why we see PC1 correlate almost 100% with the variable GNI.per.Capita in the biplot run on non-standardized variables and why the said variable seems to have a disproportionately high variation among the dataset.
As indicated above, high labour and political participation of women are not short hand for high level of well being among women. We can see that the data is principally divided between countries where there are high levels death at childbirth and adolescent births, and countries where there are high levels of life-expectancy, education and GNI. We might say that this pricipal component mostly maps reproductive health. Indeed, the former two variables are highly correlated between each other, which is not a surprise: giving birth at a young (teen) age creates a greater risk of death at childbirth due to the fact that the mother’s body is not yet fully developed. The latter four variables on the other hand predict lower levels of maternal mortality and adolescent birth, partly because higher levels of education are known to predict a lower fertility rates, and a delayed start to getting children among women, partly because higher GNI predicts better health care, both contributing to a higher life-expectancy.
On the other hand, labour and political participation presuppose neither of these sets of variables - they are almost completely uncorrelated with them. This indicates that women participate in the labour force and parliament even in poorer countries with more traditional conceptions of child-rearing a worser health-care. On the other hand, it perhaps also indicates that most countries perform suboptimally, when it comes to the participation of women in politics and the labour force. As such those two variables are bad predictors of development either in terms of money, education or health.
The tea-dataset, according to the information provided by ?tea, includes three sets of variables: the first 18 relate to tea drinking habits, the next 5 map personal details (with age featuring twice, once as integers and once as a categorical variable with five categories), and the final 13 questions mapping product perception. For this section I have chosen to divide the variables into two sets: the active, habit-variables being fed into the MCA and the supplementary personal detail variables, which will be added as colors into the plot to see if we can find links between these supplementary variables and the two dimensions of variability showcased in the plot. The product-perception variables will be excluded in this analysis. Furthermore, only the categorical age variable will be included.
Below, simple bar graphs have been created alongside the structure and dimensions of the remaining data. The reader is welcome to explor the graphs, but since their interpretation is not a requested part of this analysis, a few short comments should suffice: - We can see that there is a clear distinction between places where tea is drunk (home, friends) and where and when it is not (pub, restaurant, dinner, lunch, work) - People seem to drink Earl Grey frequently - Slightly more women answered the survey.
## [1] 300 36
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Moving on to the actual Multiple Correspondence Analysis. Below, the reader can first find the variable biplot of the MCA, where variables in red denote active variables and variables in green supplementary variables. It is first worth noting that the two dimensions identified by the MCA account together for only 16.7% of the principal inertia, meaning that most of the variation in the data is left unaccounted for. In fact, the summary-statistic shows that increasing the amount of dimensions has little effect on this, with six dimensions only accounting for 36.8% of the principal inertia.
Be that as it may, we will continue to work with these results. The theory that I offer based on the below information is that dimension 1 maps what could be generalised as “tea consumption.” We see the frequency of tea-drinking increase as we move from left to right. Additionally, the variables on the right-hand side of the plot indicate higher levels of tea consumption - to the point that tearooms are visited. The left-hand side showcase answers indicating little interest in tea: the tea is cheap, from the chain store and the people do not generally choose tea as their drink when in restaurants, at home or at friends. Dimension 2 on the other hand maps “tea enjoyment”, or something similar. We see the higher values correspond with upscale tea bought from a teashop in loose-form. The lower variables reflect a more basic approach to tea. But as said, even if correct, these dimensions and their interpretations do not explain a lot of the variance in the data.
Consequently, it is not surprising that when we apply color to the plots to map the respondents by their belonging to different categories in the four supplementary variables, we have a hard time identifying any significant linkages. Age does not seem to be linked to either of the dimensions identified. With sports, there seems to be indication of individuals that do not take part in sport being also lesser tea drinkers, but the indications for this is weak. Similarly, students and elderly seem to drink more tea when it comes to employment status. Perhaps the most clearest linkage is between gender and dimension 1, where females are more prominent at the right-hand side of the scale, perhaps indicating that women tend to be bigger casual tea drinkers, with men more often opting for other drinks, or drinking finer teas, as indicated by the higher amount of men at the top of dimension two.
Overall, these plots are not helpful. Primarily due to their low ability to account for variation in the data, and also, perhaps consequently, due to the absence of other linkages that they might have identified.
##
## Call:
## MCA(X = Tea_Who_And_How, quanti.sup = NULL, quali.sup = 19:22,
## graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.148 0.116 0.094 0.078 0.073 0.072 0.067
## % of var. 9.392 7.335 5.972 4.957 4.635 4.534 4.233
## Cumulative % of var. 9.392 16.727 22.699 27.655 32.291 36.824 41.057
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.065 0.060 0.059 0.058 0.055 0.052 0.050
## % of var. 4.106 3.829 3.719 3.672 3.500 3.319 3.193
## Cumulative % of var. 45.163 48.992 52.711 56.383 59.883 63.202 66.395
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.049 0.046 0.045 0.044 0.039 0.037 0.035
## % of var. 3.094 2.932 2.841 2.785 2.480 2.330 2.210
## Cumulative % of var. 69.488 72.421 75.262 78.047 80.527 82.857 85.067
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.034 0.034 0.029 0.028 0.027 0.026 0.024
## % of var. 2.149 2.132 1.860 1.755 1.690 1.637 1.529
## Cumulative % of var. 87.216 89.348 91.208 92.963 94.653 96.290 97.820
## Dim.29 Dim.30
## Variance 0.019 0.016
## % of var. 1.196 0.984
## Cumulative % of var. 99.016 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.573 0.739 0.160 | -0.182 0.095 0.016 | -0.348 0.427
## 2 | -0.381 0.326 0.139 | -0.127 0.047 0.016 | -0.604 1.288
## 3 | 0.095 0.020 0.006 | -0.143 0.059 0.013 | 0.303 0.325
## 4 | -0.634 0.904 0.281 | 0.009 0.000 0.000 | 0.141 0.070
## 5 | -0.138 0.043 0.023 | -0.100 0.029 0.012 | -0.105 0.039
## 6 | -0.734 1.212 0.265 | 0.018 0.001 0.000 | -0.027 0.003
## 7 | -0.084 0.016 0.008 | -0.158 0.072 0.027 | -0.081 0.023
## 8 | -0.262 0.154 0.054 | -0.022 0.001 0.000 | -0.032 0.004
## 9 | 0.198 0.088 0.033 | 0.179 0.092 0.027 | -0.593 1.244
## 10 | 0.344 0.265 0.081 | 0.436 0.546 0.130 | -0.461 0.752
## cos2
## 1 0.059 |
## 2 0.350 |
## 3 0.058 |
## 4 0.014 |
## 5 0.013 |
## 6 0.000 |
## 7 0.007 |
## 8 0.001 |
## 9 0.295 |
## 10 0.146 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | 0.231 0.908 0.049 3.835 | -0.221 1.062 0.045
## Not.breakfast | -0.213 0.838 0.049 -3.835 | 0.204 0.980 0.045
## Not.tea time | -0.483 3.622 0.181 -7.360 | 0.116 0.269 0.011
## tea time | 0.375 2.808 0.181 7.360 | -0.090 0.209 0.011
## evening | 0.331 1.332 0.057 4.133 | -0.013 0.003 0.000
## Not.evening | -0.173 0.696 0.057 -4.133 | 0.007 0.001 0.000
## lunch | 0.714 2.656 0.088 5.121 | -0.352 0.824 0.021
## Not.lunch | -0.123 0.457 0.088 -5.121 | 0.060 0.142 0.021
## dinner | -0.795 1.568 0.048 -3.769 | 0.827 2.174 0.051
## Not.dinner | 0.060 0.118 0.048 3.769 | -0.062 0.164 0.051
## v.test Dim.3 ctr cos2 v.test
## breakfast -3.665 | -0.621 10.329 0.356 -10.315 |
## Not.breakfast 3.665 | 0.573 9.535 0.356 10.315 |
## Not.tea time 1.773 | 0.158 0.607 0.019 2.403 |
## tea time -1.773 | -0.122 0.471 0.019 -2.403 |
## evening -0.169 | 0.510 4.994 0.136 6.383 |
## Not.evening 0.169 | -0.267 2.611 0.136 -6.383 |
## lunch -2.520 | 0.356 1.037 0.022 2.552 |
## Not.lunch 2.520 | -0.061 0.178 0.022 -2.552 |
## dinner 3.922 | 0.637 1.586 0.031 3.023 |
## Not.dinner -3.922 | -0.048 0.119 0.031 -3.023 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.049 0.045 0.356 |
## tea.time | 0.181 0.011 0.019 |
## evening | 0.057 0.000 0.136 |
## lunch | 0.088 0.021 0.022 |
## dinner | 0.048 0.051 0.031 |
## always | 0.060 0.003 0.053 |
## home | 0.013 0.000 0.131 |
## work | 0.099 0.049 0.002 |
## tearoom | 0.344 0.017 0.000 |
## friends | 0.208 0.011 0.160 |
##
## Supplementary categories (the 10 first)
## Dim.1 cos2 v.test Dim.2 cos2 v.test Dim.3 cos2
## F | 0.168 0.041 3.511 | -0.114 0.019 -2.375 | -0.045 0.003
## M | -0.245 0.041 -3.511 | 0.166 0.019 2.375 | 0.066 0.003
## employee | -0.153 0.006 -1.305 | -0.145 0.005 -1.243 | 0.064 0.001
## middle | -0.010 0.000 -0.071 | 0.312 0.015 2.114 | -0.330 0.017
## non-worker | 0.005 0.000 0.045 | 0.175 0.008 1.580 | -0.254 0.017
## other worker | -0.030 0.000 -0.140 | 0.009 0.000 0.041 | 0.063 0.000
## senior | 0.370 0.018 2.325 | 0.047 0.000 0.297 | -0.106 0.001
## student | 0.031 0.000 0.299 | -0.279 0.024 -2.664 | 0.391 0.047
## workman | -0.454 0.009 -1.601 | 0.216 0.002 0.764 | 0.061 0.000
## Not.sportsman | -0.031 0.001 -0.444 | 0.016 0.000 0.222 | -0.036 0.001
## v.test
## F -0.942 |
## M 0.942 |
## employee 0.545 |
## middle -2.238 |
## non-worker -2.285 |
## other worker 0.292 |
## senior -0.667 |
## student 3.732 |
## workman 0.217 |
## Not.sportsman -0.518 |
##
## Supplementary categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.041 0.019 0.003 |
## SPC | 0.029 0.044 0.066 |
## Sport | 0.001 0.000 0.001 |
## age_Q | 0.008 0.066 0.103 |
The RATS dataset measures the body weight of three groups of rats at various points in time. The groups are differentiated by the diets they were fed and their weight is measured in grams. Below the reader can find a table of the RATS dataset in wide and long form. The difference between these two forms of data is covered in the data-wrangling exercise.
## X ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1 1 1 1 240 250 255 260 262 258 266 266 265 272 278
## 2 2 2 1 225 230 230 232 240 240 243 244 238 247 245
## 3 3 3 1 245 250 250 255 262 265 267 267 264 268 269
## 4 4 4 1 260 255 255 265 265 268 270 272 274 273 275
## 5 5 5 1 255 260 255 270 270 273 274 273 276 278 280
## 6 6 6 1 260 265 270 275 275 277 278 278 284 279 281
## 7 7 7 1 275 275 260 270 273 274 276 271 282 281 284
## 8 8 8 1 245 255 260 268 270 265 265 267 273 274 278
## 9 9 9 2 410 415 425 428 438 443 442 446 456 468 478
## 10 10 10 2 405 420 430 440 448 460 458 464 475 484 496
## 11 11 11 2 445 445 450 452 455 455 451 450 462 466 472
## 12 12 12 2 555 560 565 580 590 597 595 595 612 618 628
## 13 13 13 3 470 465 475 485 487 493 493 504 507 518 525
## 14 14 14 3 535 525 530 533 535 540 525 530 543 544 559
## 15 15 15 3 520 525 530 540 543 546 538 544 553 555 548
## 16 16 16 3 510 510 520 515 530 538 535 542 550 553 569
## X ID Group WD Weight Time
## 1 1 1 1 WD1 240 1
## 2 2 2 1 WD1 225 1
## 3 3 3 1 WD1 245 1
## 4 4 4 1 WD1 260 1
## 5 5 5 1 WD1 255 1
## 6 6 6 1 WD1 260 1
## 7 7 7 1 WD1 275 1
## 8 8 8 1 WD1 245 1
## 9 9 9 2 WD1 410 1
## 10 10 10 2 WD1 405 1
## 11 11 11 2 WD1 445 1
## 12 12 12 2 WD1 555 1
## 13 13 13 3 WD1 470 1
## 14 14 14 3 WD1 535 1
## 15 15 15 3 WD1 520 1
## 16 16 16 3 WD1 510 1
## 17 17 1 1 WD8 250 8
## 18 18 2 1 WD8 230 8
## 19 19 3 1 WD8 250 8
## 20 20 4 1 WD8 255 8
## 21 21 5 1 WD8 260 8
## 22 22 6 1 WD8 265 8
## 23 23 7 1 WD8 275 8
## 24 24 8 1 WD8 255 8
## 25 25 9 2 WD8 415 8
## 26 26 10 2 WD8 420 8
## 27 27 11 2 WD8 445 8
## 28 28 12 2 WD8 560 8
## 29 29 13 3 WD8 465 8
## 30 30 14 3 WD8 525 8
To initiate the exploration of the data, we can run it through some graphical tools. As MABS chapter 8 notes, these graphical approaches, together with the summary measure approach, provide a quick and dirty way of analysing the data. To start the analysis, we will begin by handling the dataset graphically to extract information.
The groups showcase the “Tracking”-effect, whereby larger starting values predict larger ending values. This effect applies as a probabilistic predictor. Ie. a larger starting value does not determine a larger ending value, it merely makes it more probable. As such, larger starting values do not always translate to larger ending values. But, most often they do.
To really bring out the effects of “tracking,” we need to standardize the data. Understanding the effects of “tracking” is important, because otherwise we are unable to truly delimit the effects of the diet from the effects of the starting size.
The above graphs showcase the effects of both tracking and time on the datasets. In a standardized format, we are able to see that the varying diets have little effect on the weights of the mice. This means that starting size and growth-caused-by-aging explain much of the change. In fact, only group two (if we ignore the full straight line) shows compelling signs of growth brought on by the diet alone. Groups one and three seem to follow the standard pattern.
Moving on to the summary graph, we summarize the data through the means of each group.
As we are interested in the general, or mean effect of the diets on the mice, the above plot provides us with plenty of information. It shows us that group 2 has the fastest growth in weight out of the three groups. Group 3 has the second fastest growth, while group 1 remain almost stagnant.
This is not, though, the end of the analysis. We need to figure out whether certain observations are outliers and consequently contribute excessively to the results. The box plot graph is a good approach to determine this.
As is clear from the data, we are able to observe an outlier in each group. In both groups one and three, the outlier pulls the averages down, whereas with group two it increases it. There are certain situation where this might be problematic - such as when measuring the weights of mice at specific points in time - but if we are interested in the growth rates (as is the case here), these outliers do not seem to have a significant effect. This is because their variance is stable throughout the time frame. As such the outliers do not contribute to the difference in change of outcome between groups.
In the book the summary measure that is applied to the data is the measurement of differences in the means of the groups. This is a good approach for the BPRS data, because it is supposed to track the effectiveness of a treatment. As the RATS-data measures the effect of a diet on the growth of mice, we would be better served by the use of the regression coefficient as the summary measure - as per the MABS chapter 8, page 162. Nevertheless, this comes so close to the work done in chapter 9 that despite the poorer fit of the mean-summary measure, we will apply it here to the RATS data. As such, we will treat the diet as a once-off measure. Conceptually, this approach could be justified - for example - when measuring the effects of the diet of adolescent rats on their development to mature, fully grown rats. In such a case, the effects are cemented in - as if we were talking about the effects of a treatment. Nevertheless, there is no reason to ignore the baseline in this case and as such that part will not be carried out.
What the above graph shows is that variance in the results of groups 1 and 3 are negligeble, while group 2’s variance is somewhat stronger. Also, as noted above, each group has an outlier, but only group 2 has an outlier which is clearly raising the mean of the 2nd group. Accordingly, as we’re examing the means, the second graph showcases the groups with the clear outlier removed. We can see that after the removal of the outlier, the groups have very similar profiles in relation to their means, only their value changes, with group 1 having the smallest, group 3 the largest and group 2 in between. With knowledge of tracking, these results do not really tell us that much about the diet - they simply indicate that the mice were different sizes to start with. What we can do additionally, is to plot the average change in weight for each group from the start of the measurement to its end.
As is clear from these graphs, each group experience growth, but as the second graph shows, only the 2nd groups experiences above-average growth, with Group 1 following the standardized growth and Group 3, in fact, falling behind. This could - theoretically - be due to the diets. To get more formal statistical proof of this relation, we can run the data through a t-test.
## stdweight Weight Time Group
## Min. :0.3105 Min. :405.0 Min. : 1.0 1:0
## 1st Qu.:0.3844 1st Qu.:418.8 1st Qu.: 1.0 2:6
## Median :0.5083 Median :458.5 Median :32.5 3:0
## Mean :0.4941 Mean :451.0 Mean :32.5
## 3rd Qu.:0.6038 3rd Qu.:476.5 3rd Qu.:64.0
## Max. :0.6587 Max. :496.0 Max. :64.0
##
## Two Sample t-test
##
## data: Weight by Time
## t = -3.3271, df = 14, p-value = 0.004986
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -38.032466 -8.217534
## sample estimates:
## mean in group 1 mean in group 64
## 250.625 273.750
##
## Two Sample t-test
##
## data: Weight by Time
## t = -4.275, df = 4, p-value = 0.0129
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -102.26643 -21.73357
## sample estimates:
## mean in group 1 mean in group 64
## 420 482
##
## Two Sample t-test
##
## data: Weight by Time
## t = -2.4693, df = 6, p-value = 0.0485
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -82.6240161 -0.3759839
## sample estimates:
## mean in group 1 mean in group 64
## 508.75 550.25
The T-test analysis indicates rather clearly, that all differences in mean outcome are statistically significant. The clearly higher statistical significance of group 1 results is explained by the higher number of observations, resulting in more reliable observation of true differences
The incorporation of baseline measures is not done in this analysis, as the focus is on the change in weight - as opposed to the actual weights. The former is not affected by the higher starting weights. Furthermore, as there are no missing values, the tricks used in MABS Chapter 8 are not necessary here.
The BPRS dataset covers 40 male subjects who were randomly assigned to one of two treatment groups for psychiatric issues. Each subject was rated with the BPRS-measure, or the “brief psychiatric rating scale (BPRS),” which is used to evaluate patients suspected of having schizophrenia. The BPRS assesses the level of 18 psychiatric issues (inter alia: hostility, suspiciousness, hallucinations and grandiosity), each of these is rated from one (not present) to seven (extremely severe). These BPRS-measures were taken before treatment began (week 0) and then weekly for eight weeks. Below we begin with tables of the BPRS dataset in wide and long form:
## X treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1 1 1 1 42 36 36 43 41 40 38 47 51
## 2 2 1 2 58 68 61 55 43 34 28 28 28
## 3 3 1 3 54 55 41 38 43 28 29 25 24
## 4 4 1 4 55 77 49 54 56 50 47 42 46
## 5 5 1 5 72 75 72 65 50 39 32 38 32
## 6 6 1 6 48 43 41 38 36 29 33 27 25
## 7 7 1 7 71 61 47 30 27 40 30 31 31
## 8 8 1 8 30 36 38 38 31 26 26 25 24
## 9 9 1 9 41 43 39 35 28 22 20 23 21
## 10 10 1 10 57 51 51 55 53 43 43 39 32
## 11 11 1 11 30 34 34 41 36 36 38 36 36
## 12 12 1 12 55 52 49 54 48 43 37 36 31
## 13 13 1 13 36 32 36 31 25 25 21 19 22
## 14 14 1 14 38 35 36 34 25 27 25 26 26
## 15 15 1 15 66 68 65 49 36 32 27 30 37
## 16 16 1 16 41 35 45 42 31 31 29 26 30
## 17 17 1 17 45 38 46 38 40 33 27 31 27
## 18 18 1 18 39 35 27 25 29 28 21 25 20
## 19 19 1 19 24 28 31 28 29 21 22 23 22
## 20 20 1 20 38 34 27 25 25 27 21 19 21
## 21 21 2 1 52 73 42 41 39 38 43 62 50
## 22 22 2 2 30 23 32 24 20 20 19 18 20
## 23 23 2 3 65 31 33 28 22 25 24 31 32
## 24 24 2 4 37 31 27 31 31 26 24 26 23
## 25 25 2 5 59 67 58 61 49 38 37 36 35
## 26 26 2 6 30 33 37 33 28 26 27 23 21
## 27 27 2 7 69 52 41 33 34 37 37 38 35
## 28 28 2 8 62 54 49 39 55 51 55 59 66
## 29 29 2 9 38 40 38 27 31 24 22 21 21
## 30 30 2 10 65 44 31 34 39 34 41 42 39
## 31 31 2 11 78 95 75 76 66 64 64 60 75
## 32 32 2 12 38 41 36 27 29 27 21 22 23
## 33 33 2 13 63 65 60 53 52 32 37 52 28
## 34 34 2 14 40 37 31 38 35 30 33 30 27
## 35 35 2 15 40 36 55 55 42 30 26 30 37
## 36 36 2 16 54 45 35 27 25 22 22 22 22
## 37 37 2 17 33 41 30 32 46 43 43 43 43
## 38 38 2 18 28 30 29 33 30 26 36 33 30
## 39 39 2 19 52 43 26 27 24 32 21 21 21
## 40 40 2 20 47 36 32 29 25 23 23 23 23
## X treatment subject weeks bprs week
## 1 1 1 1 week0 42 0
## 2 2 1 2 week0 58 0
## 3 3 1 3 week0 54 0
## 4 4 1 4 week0 55 0
## 5 5 1 5 week0 72 0
## 6 6 1 6 week0 48 0
## 7 7 1 7 week0 71 0
## 8 8 1 8 week0 30 0
## 9 9 1 9 week0 41 0
## 10 10 1 10 week0 57 0
## 11 11 1 11 week0 30 0
## 12 12 1 12 week0 55 0
## 13 13 1 13 week0 36 0
## 14 14 1 14 week0 38 0
## 15 15 1 15 week0 66 0
## 16 16 1 16 week0 41 0
## 17 17 1 17 week0 45 0
## 18 18 1 18 week0 39 0
## 19 19 1 19 week0 24 0
## 20 20 1 20 week0 38 0
## 21 21 2 1 week0 52 0
## 22 22 2 2 week0 30 0
## 23 23 2 3 week0 65 0
## 24 24 2 4 week0 37 0
## 25 25 2 5 week0 59 0
## 26 26 2 6 week0 30 0
## 27 27 2 7 week0 69 0
## 28 28 2 8 week0 62 0
## 29 29 2 9 week0 38 0
## 30 30 2 10 week0 65 0
When possible, model-fitting provides a more rigorous tool for analysing data compared to the above discussed usmmary methods. Nevertheless, with longitudinal data, the key assumption of basic linear models - that observations be independent of other observations - often does not hold. We will begin by showcasing this graphically. Instead of examining the data through a plot that is similar to the “Plot of weight against time for rat data” in chapter 9 of MABS, here the representation of the progression of BPRS scores is done through two linear graphs, each representing one of the treatment groups. I find this much more informative than the slightly confusing numbers-chart referred to above.
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BP_Long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
The above graphical representation would seem to provide evidence of the dependent nature of the observations on the previous scores of the subject. They are not completely random, but to an extent paint a trend. This is especially clear in the graphset, which showcases how the results of each week are correlated with the results of the previous weeks.
Additionally, both the graphical representation and the basic linear model seem to indicate that treatment 1 has better results for the bprs scores of the participants. Nevertheless, the graphs are not completely clear due to the large number of subjects, and the basic linear model applied above ignores the repeated-measures structure of the data. As such, it ignores the linkage of observations through by single subjects. More nuanced models can be applied. Below, one can find a gradual progression towards a more complex and fitting linear model.
The first summary data come from the Random Intercept Model, which allows us to abandon the independence assumption between all observations and link observations belonging to one subject. As such, it permits the linear regression fit for each subject to differ in intercept from other subjects.
The next summary data is from the Random Intercept and Random Slope Model, which not only allows each subject to have their own intercept, but also their own slope. This in turn makes it possible incorporate the different progressions of each subject and the effect of time.
Finally, the third summary data is from the Random Intercept and Random Slope Model with a focus on week by treatment interaction.
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BP_Long
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BP_Long
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BP_Long
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
As we are interested in the model that provides the best representation of the data, we will use the ANOVA-method and examine the resulting p-values of the chi-squared test. The smaller the value, the better the model is compared to the contrasted model. The results of the ANOVAs can be found below:
## Data: BP_Long
## Models:
## BP_ri: bprs ~ week + treatment + (1 | subject)
## BP_rirs: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BP_ri 5 2748.7 2768.1 -1369.4 2738.7
## BP_rirs 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: BP_Long
## Models:
## BP_rirs: bprs ~ week + treatment + (week | subject)
## BP_rirsX: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BP_rirs 7 2745.4 2772.6 -1365.7 2731.4
## BP_rirsX 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Data: BP_Long
## Models:
## BP_ri: bprs ~ week + treatment + (1 | subject)
## BP_rirsX: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BP_ri 5 2748.7 2768.1 -1369.4 2738.7
## BP_rirsX 8 2744.3 2775.4 -1364.1 2728.3 10.443 3 0.01515 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The smallest p-value is between the basic random intercept model and the Random Intercept and Random Slope Model which allows from week by treatment interaction. In fact, each move towards a more complicated model produces an improvement in fit. This improvement is not statistically significant between the Random Intercept and Random Slope Model which allows from week by treatment interaction, and the general Random Intercept and Random Slope Model, but the latter is still the preferred one, as it provides the best fit overall, with the highest chi-squared score and lower p-value. As such, going forward, we will apply this model in our analysis.
In the below grahical representation we can see the applicability of our fitted model to the observed bprs scores. the colors indicate the subject within the treatment group.
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BP_Long
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
As can be seen from the above graphical representation, the fitted model provides a decent fit, but not a perfect one. What becomes clear from it though, is the fact that both treatments seem to produce an effect on the bprs-scores. The general trend is a downward one, with Treatment 1 producing slightly lower scores. As the above repretition of the summary shows us, the t-value for the effect of time on the intercept is high enough to indicate statistical significance. The correlation of the fixed effects indicates a medium strong negative correlation.
In case of the treatment type, the correlation of fixed effects gives us a intermediate negative correlation moving from treatment 1 to treatment 2. Nevertheless, the t value for this does not provide sufficient evidence of statistical significance. As such here the null hypothesis - that treatment type has no effect - is not disproved.